Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences

نویسندگان

  • Tiee-Jian Wu
  • Ying-Hsueh Huang
  • Lung-An Li
چکیده

MOTIVATION Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity beta between any pair of DNA sequences. RESULTS Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SK-LD performance is superior in both simulation and real data analysis, (4) the estimate beta of beta based on SK-LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5) beta is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays. AVAILABILITY The algorithm SK-LD, estimate beta and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal Word Sizes for Dissimilarity Measures and Estimation of the Degree of Dissimilarity Between DNA Sequences Running Head: Optimal word size and degree of dissimilarity

Motivation: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is threefold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determine the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scal...

متن کامل

Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition.

In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biom...

متن کامل

A comprehensive experimental comparison of the aggregation techniques for face recognition

In face recognition, one of the most important problems to tackle is a large amount of data and the redundancy of information contained in facial images. There are numerous approaches attempting to reduce this redundancy. One of them is information aggregation based on the results of classifiers built on selected facial areas being the most salient regions from the point of view of classificati...

متن کامل

Design of Dissimilarity Measures: A New Dissimilarity Between Species Distribution Areas

In many situations, dissimilarities between objects cannot be measured directly, but have to be constructed from some known characteristics of the objects of interest, e.g. some values on certain variables. From a philosophical point of view, the assumption of the objective existence of a “true” but not directly observable dissimilarity value between two objects is highly questionable. Therefor...

متن کامل

A Normalized Parameter for Similarity/Dissimilarity Characterization of Sequences

Abstract.We propose a normalized parameter for characterization of similarity/dissimilarity of two sequences providing a smoothly varying measure for varying symmetry score. Such a parameter can be used for analysis of experimental data and fitting to a theoretical model, mirror symmetry estimation with respect to a selected or presumed symmetry axis, in particular, in symmetry detection applic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 21 22  شماره 

صفحات  -

تاریخ انتشار 2005